Message ID | 20201211152649.12123-3-maximmi@mellanox.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | HTB offload | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net-next |
netdev/subject_prefix | success | Link |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 7662 this patch: 7662 |
netdev/kdoc | success | Errors and warnings before: 38 this patch: 38 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 723 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 7772 this patch: 7772 |
netdev/header_inline | success | Link |
netdev/stable | success | Stable not CCed |
On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote: > > HTB doesn't scale well because of contention on a single lock, and it > also consumes CPU. This patch adds support for offloading HTB to > hardware that supports hierarchical rate limiting. > > This solution addresses two main problems of scaling HTB: > > 1. Contention by flow classification. Currently the filters are attached > to the HTB instance as follows: I do not think this is the reason, tcf_classify() has been called with RCU only on the ingress side for a rather long time. What contentions are you talking about here? > > # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 > classid 1:10 > > It's possible to move classification to clsact egress hook, which is > thread-safe and lock-free: > > # tc filter add dev eth0 egress protocol ip flower dst_port 80 > action skbedit priority 1:10 > > This way classification still happens in software, but the lock > contention is eliminated, and it happens before selecting the TX queue, > allowing the driver to translate the class to the corresponding hardware > queue. Sure, you can use clsact with HTB, or any combinations you like, but you can't assume your HTB only works with clsact, can you? > > Note that this is already compatible with non-offloaded HTB and doesn't > require changes to the kernel nor iproute2. > > 2. Contention by handling packets. HTB is not multi-queue, it attaches > to a whole net device, and handling of all packets takes the same lock. > When HTB is offloaded, its algorithm is done in hardware. HTB registers > itself as a multi-queue qdisc, similarly to mq: HTB is attached to the > netdev, and each queue has its own qdisc. The control flow is still done > by HTB: it calls the driver via ndo_setup_tc to replicate the hierarchy > of classes in the NIC. Leaf classes are presented by hardware queues. > The data path works as follows: a packet is classified by clsact, the > driver selects a hardware queue according to its class, and the packet > is enqueued into this queue's qdisc. I do _not_ read your code, from what you describe here, it sounds like you just want a per-queue rate limit, instead of a global one. So why bothering HTB whose goal is a global rate limit? And doesn't TBF already work with mq? I mean you can attach it as a leaf to each mq so that the tree lock will not be shared either, but you'd lose the benefits of a global rate limit too. EDT does basically the same, but it never claims to completely replace HTB. ;) Thanks.
On 2020-12-11 21:16, Cong Wang wrote: > On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote: >> >> HTB doesn't scale well because of contention on a single lock, and it >> also consumes CPU. This patch adds support for offloading HTB to >> hardware that supports hierarchical rate limiting. >> >> This solution addresses two main problems of scaling HTB: >> >> 1. Contention by flow classification. Currently the filters are attached >> to the HTB instance as follows: > > I do not think this is the reason, tcf_classify() has been called with RCU > only on the ingress side for a rather long time. What contentions are you > talking about here? When one attaches filters to HTB, tcf_classify is called from htb_classify, which is called from htb_enqueue, which is called with the root spinlock of the qdisc taken. >> >> # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 >> classid 1:10 >> >> It's possible to move classification to clsact egress hook, which is >> thread-safe and lock-free: >> >> # tc filter add dev eth0 egress protocol ip flower dst_port 80 >> action skbedit priority 1:10 >> >> This way classification still happens in software, but the lock >> contention is eliminated, and it happens before selecting the TX queue, >> allowing the driver to translate the class to the corresponding hardware >> queue. > > Sure, you can use clsact with HTB, or any combinations you like, but you > can't assume your HTB only works with clsact, can you? The goal is to eliminate the root lock from the datapath, and the traditional filters attached to the HTB itself are handled under that lock. I believe it's a sane limitation, given that the offloaded mode is a new mode of operation, it's opt-in, and it may also have additional hardware-imposed limitations. > >> >> Note that this is already compatible with non-offloaded HTB and doesn't >> require changes to the kernel nor iproute2. >> >> 2. Contention by handling packets. HTB is not multi-queue, it attaches >> to a whole net device, and handling of all packets takes the same lock. >> When HTB is offloaded, its algorithm is done in hardware. HTB registers >> itself as a multi-queue qdisc, similarly to mq: HTB is attached to the >> netdev, and each queue has its own qdisc. The control flow is still done >> by HTB: it calls the driver via ndo_setup_tc to replicate the hierarchy >> of classes in the NIC. Leaf classes are presented by hardware queues. >> The data path works as follows: a packet is classified by clsact, the >> driver selects a hardware queue according to its class, and the packet >> is enqueued into this queue's qdisc. > > I do _not_ read your code, from what you describe here, it sounds like > you just want a per-queue rate limit, instead of a global one. So why > bothering HTB whose goal is a global rate limit? I would disagree. HTB's goal is hierarchical rate limits with borrowing. Sure, it can be used just to set a global limit, but it's main purpose is creating a hierarchy of classes. And yes, we are talking about the whole netdevice here, not about per-queue limits (we already support per-queue rate limits with the means of tx_maxrate, so we wouldn't need any new code for that). The tree of classes is global for the whole netdevice, with hierarchy and borrowing supported. These additional send queues can be considered as hardware objects that represent offloaded leaf traffic classes (which can be extended to multiple queues per class). So, we are really offloading HTB functionality here, not just using HTB interface for something else (something simpler). I hope it sounds better for you now. > And doesn't TBF already work with mq? I mean you can attach it as > a leaf to each mq so that the tree lock will not be shared either, but you'd > lose the benefits of a global rate limit too. Yes, I'd lose not only the global rate limit, but also multi-level hierarchical limits, which are all provided by this HTB offload - that's why TBF is not really a replacement for this feature. > EDT does basically the same, > but it never claims to completely replace HTB. ;) > > Thanks. >
On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote: > > On 2020-12-11 21:16, Cong Wang wrote: > > On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote: > >> > >> HTB doesn't scale well because of contention on a single lock, and it > >> also consumes CPU. This patch adds support for offloading HTB to > >> hardware that supports hierarchical rate limiting. > >> > >> This solution addresses two main problems of scaling HTB: > >> > >> 1. Contention by flow classification. Currently the filters are attached > >> to the HTB instance as follows: > > > > I do not think this is the reason, tcf_classify() has been called with RCU > > only on the ingress side for a rather long time. What contentions are you > > talking about here? > > When one attaches filters to HTB, tcf_classify is called from > htb_classify, which is called from htb_enqueue, which is called with the > root spinlock of the qdisc taken. So it has nothing to do with tcf_classify() itself... :-/ [...] > > And doesn't TBF already work with mq? I mean you can attach it as > > a leaf to each mq so that the tree lock will not be shared either, but you'd > > lose the benefits of a global rate limit too. > > Yes, I'd lose not only the global rate limit, but also multi-level > hierarchical limits, which are all provided by this HTB offload - that's > why TBF is not really a replacement for this feature. Interesting, please explain how your HTB offload still has a global rate limit and borrowing across queues? I simply can't see it, all I can see is you offload HTB into each queue in ->attach(), where I assume the hardware will do rate limit on each queue, if the hardware also has a global control, why it is not reflected on the root qdisc? Thanks!
On 2020-12-14 21:35, Cong Wang wrote: > On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote: >> >> On 2020-12-11 21:16, Cong Wang wrote: >>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote: >>>> >>>> HTB doesn't scale well because of contention on a single lock, and it >>>> also consumes CPU. This patch adds support for offloading HTB to >>>> hardware that supports hierarchical rate limiting. >>>> >>>> This solution addresses two main problems of scaling HTB: >>>> >>>> 1. Contention by flow classification. Currently the filters are attached >>>> to the HTB instance as follows: >>> >>> I do not think this is the reason, tcf_classify() has been called with RCU >>> only on the ingress side for a rather long time. What contentions are you >>> talking about here? >> >> When one attaches filters to HTB, tcf_classify is called from >> htb_classify, which is called from htb_enqueue, which is called with the >> root spinlock of the qdisc taken. > > So it has nothing to do with tcf_classify() itself... :-/ > > [...] > >>> And doesn't TBF already work with mq? I mean you can attach it as >>> a leaf to each mq so that the tree lock will not be shared either, but you'd >>> lose the benefits of a global rate limit too. >> >> Yes, I'd lose not only the global rate limit, but also multi-level >> hierarchical limits, which are all provided by this HTB offload - that's >> why TBF is not really a replacement for this feature. > > Interesting, please explain how your HTB offload still has a global rate > limit and borrowing across queues? Sure, I will explain that. > I simply can't see it, all I can see > is you offload HTB into each queue in ->attach(), In the non-offload mode, the same HTB instance would be attached to all queues. In the offload mode, HTB behaves like MQ: there is a root instance of HTB, but each queue gets a separate simple qdisc (pfifo). Only the root qdisc (HTB) gets offloaded, and when that happens, the NIC creates an object for the QoS root. Then all configuration changes are sent to the driver, and it issues the corresponding firmware commands to replicate the whole hierarchy in the NIC. Leaf classes correspond to queue groups (in this implementation queue groups contain only one queue, but it can be extended), and inner classes correspond to entities called TSARs. The information about rate limits is stored inside TSARs and queue groups. Queues know what groups they belong to, and groups and TSARs know what TSAR is their parent. A queue is picked in ndo_select_queue by looking at the classification result of clsact. So, when a packet is put onto a queue, the NIC can track the whole hierarchy and do the HTB algorithm. > where I assume the > hardware will do rate limit on each queue, So, it's not flat in the NIC, and rate limiting is done in a hierarchical way. > if the hardware also has a > global control, why it is not reflected on the root qdisc? I'm not sure if I got this last question correctly. The root qdisc is HTB, and all the configuration of the HTB tree gets reflected in the NIC, as I just explained. I hope now it's clearer, but if you still have questions, I'm glad to explain more details (also, I'm ready to respin with the minor fixes for the CI build issue on parisc). Thanks, Max > Thanks! >
On 2020-12-14 3:30 p.m., Maxim Mikityanskiy wrote: > On 2020-12-14 21:35, Cong Wang wrote: >> On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy >> <maximmi@nvidia.com> wrote: >>> >>> On 2020-12-11 21:16, Cong Wang wrote: >>>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy >>>> <maximmi@mellanox.com> wrote: >>>>> >> >> Interesting, please explain how your HTB offload still has a global rate >> limit and borrowing across queues? > > Sure, I will explain that. > >> I simply can't see it, all I can see >> is you offload HTB into each queue in ->attach(), > > In the non-offload mode, the same HTB instance would be attached to all > queues. In the offload mode, HTB behaves like MQ: there is a root > instance of HTB, but each queue gets a separate simple qdisc (pfifo). > Only the root qdisc (HTB) gets offloaded, and when that happens, the NIC > creates an object for the QoS root. > > Then all configuration changes are sent to the driver, and it issues the > corresponding firmware commands to replicate the whole hierarchy in the > NIC. Leaf classes correspond to queue groups (in this implementation > queue groups contain only one queue, but it can be extended), FWIW, it is very valuable to be able to abstract HTB if the hardware can emulate it (users dont have to learn about new abstracts). Since you are expressing a limitation above: How does the user discover if they over-provisioned i.e single queue example above? If there are too many corner cases it may make sense to just create a new qdisc. > and inner > classes correspond to entities called TSARs. > > The information about rate limits is stored inside TSARs and queue > groups. Queues know what groups they belong to, and groups and TSARs > know what TSAR is their parent. A queue is picked in ndo_select_queue by > looking at the classification result of clsact. So, when a packet is put > onto a queue, the NIC can track the whole hierarchy and do the HTB > algorithm. > Same question above: Is there a limit to the number of classes that can be created? IOW, if someone just created an arbitrary number of queues do they get errored-out if it doesnt make sense for the hardware? If such limits exist, it may make sense to provide a knob to query (maybe ethtool) and if such limits can be adjusted it may be worth looking at providing interfaces via devlink. cheers, jamal cheers, jamal
On 2020-12-15 18:37, Jamal Hadi Salim wrote: > On 2020-12-14 3:30 p.m., Maxim Mikityanskiy wrote: >> On 2020-12-14 21:35, Cong Wang wrote: >>> On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy >>> <maximmi@nvidia.com> wrote: >>>> >>>> On 2020-12-11 21:16, Cong Wang wrote: >>>>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy >>>>> <maximmi@mellanox.com> wrote: >>>>>> > > >>> >>> Interesting, please explain how your HTB offload still has a global rate >>> limit and borrowing across queues? >> >> Sure, I will explain that. >> >>> I simply can't see it, all I can see >>> is you offload HTB into each queue in ->attach(), >> >> In the non-offload mode, the same HTB instance would be attached to >> all queues. In the offload mode, HTB behaves like MQ: there is a root >> instance of HTB, but each queue gets a separate simple qdisc (pfifo). >> Only the root qdisc (HTB) gets offloaded, and when that happens, the >> NIC creates an object for the QoS root. >> >> Then all configuration changes are sent to the driver, and it issues >> the corresponding firmware commands to replicate the whole hierarchy >> in the NIC. Leaf classes correspond to queue groups (in this >> implementation queue groups contain only one queue, but it can be >> extended), > > > FWIW, it is very valuable to be able to abstract HTB if the hardware > can emulate it (users dont have to learn about new abstracts). Yes, that's the reason for using an existing interface (HTB) to configure the feature. > Since you are expressing a limitation above: > How does the user discover if they over-provisioned i.e single > queue example above? It comes to the CPU usage. If the core that serves the queue is busy with sending packets 100% of time, you need more queues. Also, if the user runs more than one application belonging to the same class, and pins them to different cores, it makes sense to create more than one queue. I'd like to emphasize that this is not a hard limitation. Our hardware and firmware supports multiple queues per class. What's needed is the support from the driver side and probably an additional parameter to tc class add to specify the number of queues to reserve. > If there are too many corner cases it may > make sense to just create a new qdisc. > >> and inner classes correspond to entities called TSARs. >> >> The information about rate limits is stored inside TSARs and queue >> groups. Queues know what groups they belong to, and groups and TSARs >> know what TSAR is their parent. A queue is picked in ndo_select_queue >> by looking at the classification result of clsact. So, when a packet >> is put onto a queue, the NIC can track the whole hierarchy and do the >> HTB algorithm. >> > > Same question above: > Is there a limit to the number of classes that can be created? Yes, the commit message of the mlx5 patch lists the limitations of our NICs. Basically, it's 256 leaf classes and 3 levels of hierarchy. > IOW, if someone just created an arbitrary number of queues do they > get errored-out if it doesnt make sense for the hardware? The current implementation starts failing gracefully if the limits are exceeded. The tc command won't succeed, and everything will roll back to the stable state, which was just before the tc command. > If such limits exist, it may make sense to provide a knob to query > (maybe ethtool) Sounds legit, but I'm not sure what would be the best interface for that. Ethtool is not involved at all in this implementation, and AFAIK it doesn't contain any existing command for similar stuff. We could hook into set-channels and add new type of channels for HTB, but the semantics isn't very clear, because HTB queues != HTB leaf classes, and I don't know if it's allowed to extend this interface (if so, I have more thoughts of extending it for other purposes). > and if such limits can be adjusted it may be worth > looking at providing interfaces via devlink. Not really. At the moment, there isn't a good reason to decrease the maximum limits. It would make sense if it could free up some resources for something else, but AFAIK it's not the case now. Thanks, Max > cheers, > jamal > > > cheers, > jamal
On Mon, Dec 14, 2020 at 12:30 PM Maxim Mikityanskiy <maximmi@nvidia.com> wrote: > > On 2020-12-14 21:35, Cong Wang wrote: > > On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote: > >> > >> On 2020-12-11 21:16, Cong Wang wrote: > >>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote: > >>>> > >>>> HTB doesn't scale well because of contention on a single lock, and it > >>>> also consumes CPU. This patch adds support for offloading HTB to > >>>> hardware that supports hierarchical rate limiting. > >>>> > >>>> This solution addresses two main problems of scaling HTB: > >>>> > >>>> 1. Contention by flow classification. Currently the filters are attached > >>>> to the HTB instance as follows: > >>> > >>> I do not think this is the reason, tcf_classify() has been called with RCU > >>> only on the ingress side for a rather long time. What contentions are you > >>> talking about here? > >> > >> When one attaches filters to HTB, tcf_classify is called from > >> htb_classify, which is called from htb_enqueue, which is called with the > >> root spinlock of the qdisc taken. > > > > So it has nothing to do with tcf_classify() itself... :-/ > > > > [...] > > > >>> And doesn't TBF already work with mq? I mean you can attach it as > >>> a leaf to each mq so that the tree lock will not be shared either, but you'd > >>> lose the benefits of a global rate limit too. > >> > >> Yes, I'd lose not only the global rate limit, but also multi-level > >> hierarchical limits, which are all provided by this HTB offload - that's > >> why TBF is not really a replacement for this feature. > > > > Interesting, please explain how your HTB offload still has a global rate > > limit and borrowing across queues? > > Sure, I will explain that. > > > I simply can't see it, all I can see > > is you offload HTB into each queue in ->attach(), > > In the non-offload mode, the same HTB instance would be attached to all > queues. In the offload mode, HTB behaves like MQ: there is a root > instance of HTB, but each queue gets a separate simple qdisc (pfifo). > Only the root qdisc (HTB) gets offloaded, and when that happens, the NIC > creates an object for the QoS root. Please add this to your changelog. And why is the offloaded root qdisc not visible to software? All you add to root HTB are pointers of direct qdisc's and a boolean, this is what I meant by "not reflected". I expect the hardware parameters/stats are exposed to software too, but I can't find any. > > Then all configuration changes are sent to the driver, and it issues the > corresponding firmware commands to replicate the whole hierarchy in the > NIC. Leaf classes correspond to queue groups (in this implementation > queue groups contain only one queue, but it can be extended), and inner > classes correspond to entities called TSARs. > > The information about rate limits is stored inside TSARs and queue > groups. Queues know what groups they belong to, and groups and TSARs > know what TSAR is their parent. A queue is picked in ndo_select_queue by > looking at the classification result of clsact. So, when a packet is put > onto a queue, the NIC can track the whole hierarchy and do the HTB > algorithm. Glad to know hardware still keeps HTB as a hierarchy. Please also add this either to source code as comments or in your changelog, it is very important to understand what is done by hardware. Thanks.
On 2020-12-16 21:01, Cong Wang wrote: > On Mon, Dec 14, 2020 at 12:30 PM Maxim Mikityanskiy <maximmi@nvidia.com> wrote: >> >> On 2020-12-14 21:35, Cong Wang wrote: >>> On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote: >>>> >>>> On 2020-12-11 21:16, Cong Wang wrote: >>>>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote: >>>>>> >>>>>> HTB doesn't scale well because of contention on a single lock, and it >>>>>> also consumes CPU. This patch adds support for offloading HTB to >>>>>> hardware that supports hierarchical rate limiting. >>>>>> >>>>>> This solution addresses two main problems of scaling HTB: >>>>>> >>>>>> 1. Contention by flow classification. Currently the filters are attached >>>>>> to the HTB instance as follows: >>>>> >>>>> I do not think this is the reason, tcf_classify() has been called with RCU >>>>> only on the ingress side for a rather long time. What contentions are you >>>>> talking about here? >>>> >>>> When one attaches filters to HTB, tcf_classify is called from >>>> htb_classify, which is called from htb_enqueue, which is called with the >>>> root spinlock of the qdisc taken. >>> >>> So it has nothing to do with tcf_classify() itself... :-/ >>> >>> [...] >>> >>>>> And doesn't TBF already work with mq? I mean you can attach it as >>>>> a leaf to each mq so that the tree lock will not be shared either, but you'd >>>>> lose the benefits of a global rate limit too. >>>> >>>> Yes, I'd lose not only the global rate limit, but also multi-level >>>> hierarchical limits, which are all provided by this HTB offload - that's >>>> why TBF is not really a replacement for this feature. >>> >>> Interesting, please explain how your HTB offload still has a global rate >>> limit and borrowing across queues? >> >> Sure, I will explain that. >> >>> I simply can't see it, all I can see >>> is you offload HTB into each queue in ->attach(), >> >> In the non-offload mode, the same HTB instance would be attached to all >> queues. In the offload mode, HTB behaves like MQ: there is a root >> instance of HTB, but each queue gets a separate simple qdisc (pfifo). >> Only the root qdisc (HTB) gets offloaded, and when that happens, the NIC >> creates an object for the QoS root. > > Please add this to your changelog. The similar information is already in the commit message (the paragraph under point 2.), but I can rephrase it and elaborate on this, focusing on the interaction between HTB, driver and hardware. > And why is the offloaded root qdisc not visible to software? All you add to > root HTB are pointers of direct qdisc's and a boolean, this is what I meant > by "not reflected". I expect the hardware parameters/stats are exposed to > software too, but I can't find any. Hardware parameters are rate and ceil, there are no extra parameters that exist only in the offload mode. In the future, we may add the number of queues to reserve for a class - that will be configured and reflected in tc. Regarding stats, unfortunately, the hardware doesn't expose any low-level stats like numbers of tokens, etc. Basic stats like packets and bytes are supported and exposed in tc, but they are calculated in software (by per-queue qdiscs) - similarly to how we calculate per-TX-queue packets and bytes in software. > >> >> Then all configuration changes are sent to the driver, and it issues the >> corresponding firmware commands to replicate the whole hierarchy in the >> NIC. Leaf classes correspond to queue groups (in this implementation >> queue groups contain only one queue, but it can be extended), and inner >> classes correspond to entities called TSARs. >> >> The information about rate limits is stored inside TSARs and queue >> groups. Queues know what groups they belong to, and groups and TSARs >> know what TSAR is their parent. A queue is picked in ndo_select_queue by >> looking at the classification result of clsact. So, when a packet is put >> onto a queue, the NIC can track the whole hierarchy and do the HTB >> algorithm. > > Glad to know hardware still keeps HTB as a hierarchy. > > Please also add this either to source code as comments or in your > changelog, it is very important to understand what is done by hardware. OK, as I said above in this letter, I can rephrase the commit message, focusing on details about interaction between HTB <-> driver <-> NIC and what happens in the NIC. It should look better if I put it in a dedicated paragraph, instead of mentioning it under the contention problem. Thanks, Max > Thanks. >
On 2020-12-16 6:47 a.m., Maxim Mikityanskiy wrote: > On 2020-12-15 18:37, Jamal Hadi Salim wrote: [..] >> >> Same question above: >> Is there a limit to the number of classes that can be created? > > Yes, the commit message of the mlx5 patch lists the limitations of our > NICs. Basically, it's 256 leaf classes and 3 levels of hierarchy. > Ok, thats what i was looking for. >> IOW, if someone just created an arbitrary number of queues do they >> get errored-out if it doesnt make sense for the hardware? > > The current implementation starts failing gracefully if the limits are > exceeded. The tc command won't succeed, and everything will roll back to > the stable state, which was just before the tc command. > Does the user gets notified somehow or it fails silently? An extack message would help. >> If such limits exist, it may make sense to provide a knob to query >> (maybe ethtool) > > Sounds legit, but I'm not sure what would be the best interface for > that. Ethtool is not involved at all in this implementation, and AFAIK > it doesn't contain any existing command for similar stuff. We could hook > into set-channels and add new type of channels for HTB, but the > semantics isn't very clear, because HTB queues != HTB leaf classes, and > I don't know if it's allowed to extend this interface (if so, I have > more thoughts of extending it for other purposes). > More looking to make sure no suprise to the user. Either the user can discover what the constraints are or when they provision they get a a message like "cannot offload more than 3 hierarchies" or "use devlink if you want to use more than 256 classes", etc. cheers, jamal
On 2020-12-17 17:09, Jamal Hadi Salim wrote: > On 2020-12-16 6:47 a.m., Maxim Mikityanskiy wrote: >> On 2020-12-15 18:37, Jamal Hadi Salim wrote: > > [..] > >>> >>> Same question above: >>> Is there a limit to the number of classes that can be created? >> >> Yes, the commit message of the mlx5 patch lists the limitations of our >> NICs. Basically, it's 256 leaf classes and 3 levels of hierarchy. >> > > Ok, thats what i was looking for. > > >>> IOW, if someone just created an arbitrary number of queues do they >>> get errored-out if it doesnt make sense for the hardware? >> >> The current implementation starts failing gracefully if the limits are >> exceeded. The tc command won't succeed, and everything will roll back >> to the stable state, which was just before the tc command. >> > > Does the user gets notified somehow or it fails silently? > An extack message would help. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1608288498; bh=RMhOdIb4utXHB89dvxKO9kM+jQNSV+kIKsJ0O6KNJWs=; h=Subject:To:CC:References:From:Message-ID:Date:User-Agent: MIME-Version:In-Reply-To:Content-Type:Content-Language: Content-Transfer-Encoding:X-Originating-IP:X-ClientProxiedBy; b=pThe8Bqxykn4jFcf5ZYZMSgEPZZBiId+lReLZZ9KFCdbROoDwWPTGDWFNhOADAvRA 7bjihDOFAvu2+MXP1m9/IS0lQgKXa2iOKAk5S1QwZj+aF5m9bZ1ya7uRSZiK7k32Tp cPyfic9k+SXTN0zyJovtWc5i4QlwnHWCQ7zPvMYwB4Kmvmk/YiRwIQvEooMDQaq92Z 7+scnlRaNVAEd8ZERye9Fxa7htaGiUArPQR3aS7ol1QmrCXaN1hZA84lySCaLcJ6JA 6qFKvjj0XWP0TsYJqpZzkS+F0yfGOAphtaIMWq/o8m5Julk0e90Yj6maRvGG3kFOaY c4zEilz1uAHKA== The current implementation doesn't use extack, just returns an error code, because many callbacks to the qdisc don't get extack as a parameter. However, I agree with you, these messages would be helpful for the user when an operation fails due to hardware limitations - it will be easier than guessing what caused a EINVAL, so I'll add them. I will review which callbacks lacked an extack, and I might add it if it's meaningful for them. > >>> If such limits exist, it may make sense to provide a knob to query >>> (maybe ethtool) >> >> Sounds legit, but I'm not sure what would be the best interface for >> that. Ethtool is not involved at all in this implementation, and AFAIK >> it doesn't contain any existing command for similar stuff. We could >> hook into set-channels and add new type of channels for HTB, but the >> semantics isn't very clear, because HTB queues != HTB leaf classes, >> and I don't know if it's allowed to extend this interface (if so, I >> have more thoughts of extending it for other purposes). >> > > More looking to make sure no suprise to the user. Either the user can > discover what the constraints are or when they provision they get a > a message like "cannot offload more than 3 hierarchies" or "use devlink > if you want to use more than 256 classes", etc. Yes, it makes perfect sense. Messages are even more user-friendly, as for me. So, I'll add such messages to extack, and as the limitations are driver-specific, I'll pass extack to the driver. I will respin when net-next reopens, in the meanwhile comments are welcome. Thanks, Max > > cheers, > jamal
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 7bf167993c05..6830a8d2dbe9 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -859,6 +859,7 @@ enum tc_setup_type { TC_SETUP_QDISC_ETS, TC_SETUP_QDISC_TBF, TC_SETUP_QDISC_FIFO, + TC_SETUP_QDISC_HTB, }; /* These structures hold the attributes of bpf state that are being passed diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index 0f2a9c44171c..73ec1639365b 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -783,6 +783,39 @@ struct tc_mq_qopt_offload { }; }; +enum tc_htb_command { + /* Root */ + TC_HTB_CREATE, /* Initialize HTB offload. */ + TC_HTB_DESTROY, /* Destroy HTB offload. */ + + /* Classes */ + /* Allocate qid and create leaf. */ + TC_HTB_LEAF_ALLOC_QUEUE, + /* Convert leaf to inner, preserve and return qid, create new leaf. */ + TC_HTB_LEAF_TO_INNER, + /* Delete leaf, while siblings remain. */ + TC_HTB_LEAF_DEL, + /* Delete leaf, convert parent to leaf, preserving qid. */ + TC_HTB_LEAF_DEL_LAST, + /* Modify parameters of a node. */ + TC_HTB_NODE_MODIFY, + + /* Class qdisc */ + TC_HTB_LEAF_QUERY_QUEUE, /* Query qid by classid. */ +}; + +struct tc_htb_qopt_offload { + enum tc_htb_command command; + u16 classid; + u32 parent_classid; + u16 qid; + u16 moved_qid; + u64 rate; + u64 ceil; +}; + +#define TC_HTB_CLASSID_ROOT U32_MAX + enum tc_red_command { TC_RED_REPLACE, TC_RED_DESTROY, diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h index 9e7c2c607845..79a699f106b1 100644 --- a/include/uapi/linux/pkt_sched.h +++ b/include/uapi/linux/pkt_sched.h @@ -434,6 +434,7 @@ enum { TCA_HTB_RATE64, TCA_HTB_CEIL64, TCA_HTB_PAD, + TCA_HTB_OFFLOAD, __TCA_HTB_MAX, }; diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index cd70dbcbd72f..fccdce591104 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -174,6 +174,11 @@ struct htb_sched { int row_mask[TC_HTB_MAXDEPTH]; struct htb_level hlevel[TC_HTB_MAXDEPTH]; + + struct Qdisc **direct_qdiscs; + unsigned int num_direct_qdiscs; + + bool offload; }; /* find class in global hash table using given handle */ @@ -957,7 +962,7 @@ static void htb_reset(struct Qdisc *sch) if (cl->level) memset(&cl->inner, 0, sizeof(cl->inner)); else { - if (cl->leaf.q) + if (cl->leaf.q && !q->offload) qdisc_reset(cl->leaf.q); } cl->prio_activity = 0; @@ -980,6 +985,7 @@ static const struct nla_policy htb_policy[TCA_HTB_MAX + 1] = { [TCA_HTB_DIRECT_QLEN] = { .type = NLA_U32 }, [TCA_HTB_RATE64] = { .type = NLA_U64 }, [TCA_HTB_CEIL64] = { .type = NLA_U64 }, + [TCA_HTB_OFFLOAD] = { .type = NLA_FLAG }, }; static void htb_work_func(struct work_struct *work) @@ -992,12 +998,27 @@ static void htb_work_func(struct work_struct *work) rcu_read_unlock(); } +static void htb_set_lockdep_class_child(struct Qdisc *q) +{ + static struct lock_class_key child_key; + + lockdep_set_class(qdisc_lock(q), &child_key); +} + +static int htb_offload(struct net_device *dev, struct tc_htb_qopt_offload *opt) +{ + return dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_HTB, opt); +} + static int htb_init(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack) { + struct net_device *dev = qdisc_dev(sch); + struct tc_htb_qopt_offload offload_opt; struct htb_sched *q = qdisc_priv(sch); struct nlattr *tb[TCA_HTB_MAX + 1]; struct tc_htb_glob *gopt; + unsigned int ntx; int err; qdisc_watchdog_init(&q->watchdog, sch); @@ -1022,9 +1043,23 @@ static int htb_init(struct Qdisc *sch, struct nlattr *opt, if (gopt->version != HTB_VER >> 16) return -EINVAL; + q->offload = nla_get_flag(tb[TCA_HTB_OFFLOAD]); + + if (q->offload) { + if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) + return -EOPNOTSUPP; + + q->num_direct_qdiscs = dev->real_num_tx_queues; + q->direct_qdiscs = kcalloc(q->num_direct_qdiscs, + sizeof(*q->direct_qdiscs), + GFP_KERNEL); + if (!q->direct_qdiscs) + return -ENOMEM; + } + err = qdisc_class_hash_init(&q->clhash); if (err < 0) - return err; + goto err_free_direct_qdiscs; qdisc_skb_head_init(&q->direct_queue); @@ -1037,7 +1072,106 @@ static int htb_init(struct Qdisc *sch, struct nlattr *opt, q->rate2quantum = 1; q->defcls = gopt->defcls; + if (!q->offload) + return 0; + + for (ntx = 0; ntx < q->num_direct_qdiscs; ntx++) { + struct netdev_queue *dev_queue = netdev_get_tx_queue(dev, ntx); + struct Qdisc *qdisc; + + qdisc = qdisc_create_dflt(dev_queue, &pfifo_qdisc_ops, + TC_H_MAKE(sch->handle, 0), extack); + if (!qdisc) { + err = -ENOMEM; + goto err_free_qdiscs; + } + + htb_set_lockdep_class_child(qdisc); + q->direct_qdiscs[ntx] = qdisc; + qdisc->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT; + } + + sch->flags |= TCQ_F_MQROOT; + + offload_opt = (struct tc_htb_qopt_offload) { + .command = TC_HTB_CREATE, + .parent_classid = TC_H_MAJ(sch->handle) >> 16, + .classid = TC_H_MIN(q->defcls), + }; + err = htb_offload(dev, &offload_opt); + if (err) + goto err_free_qdiscs; + return 0; + +err_free_qdiscs: + /* TC_HTB_CREATE call failed, avoid any further calls to the driver. */ + q->offload = false; + + for (ntx = 0; ntx < q->num_direct_qdiscs && q->direct_qdiscs[ntx]; + ntx++) + qdisc_put(q->direct_qdiscs[ntx]); + + qdisc_class_hash_destroy(&q->clhash); + /* Prevent use-after-free and double-free when htb_destroy gets called. + */ + q->clhash.hash = NULL; + q->clhash.hashsize = 0; + +err_free_direct_qdiscs: + kfree(q->direct_qdiscs); + q->direct_qdiscs = NULL; + return err; +} + +static void htb_attach_offload(struct Qdisc *sch) +{ + struct net_device *dev = qdisc_dev(sch); + struct htb_sched *q = qdisc_priv(sch); + unsigned int ntx; + + for (ntx = 0; ntx < q->num_direct_qdiscs; ntx++) { + struct Qdisc *old, *qdisc = q->direct_qdiscs[ntx]; + + old = dev_graft_qdisc(qdisc->dev_queue, qdisc); + qdisc_put(old); + qdisc_hash_add(qdisc, false); + } + for (ntx = q->num_direct_qdiscs; ntx < dev->num_tx_queues; ntx++) { + struct netdev_queue *dev_queue = netdev_get_tx_queue(dev, ntx); + struct Qdisc *old = dev_graft_qdisc(dev_queue, NULL); + + qdisc_put(old); + } + + kfree(q->direct_qdiscs); + q->direct_qdiscs = NULL; +} + +static void htb_attach_software(struct Qdisc *sch) +{ + struct net_device *dev = qdisc_dev(sch); + unsigned int ntx; + + /* Resemble qdisc_graft behavior. */ + for (ntx = 0; ntx < dev->num_tx_queues; ntx++) { + struct netdev_queue *dev_queue = netdev_get_tx_queue(dev, ntx); + struct Qdisc *old = dev_graft_qdisc(dev_queue, sch); + + qdisc_refcount_inc(sch); + + qdisc_put(old); + } +} + +static void htb_attach(struct Qdisc *sch) +{ + struct htb_sched *q = qdisc_priv(sch); + + if (q->offload) + htb_attach_offload(sch); + else + htb_attach_software(sch); } static int htb_dump(struct Qdisc *sch, struct sk_buff *skb) @@ -1046,6 +1180,11 @@ static int htb_dump(struct Qdisc *sch, struct sk_buff *skb) struct nlattr *nest; struct tc_htb_glob gopt; + if (q->offload) + sch->flags |= TCQ_F_OFFLOADED; + else + sch->flags &= ~TCQ_F_OFFLOADED; + sch->qstats.overlimits = q->overlimits; /* Its safe to not acquire qdisc lock. As we hold RTNL, * no change can happen on the qdisc parameters. @@ -1063,6 +1202,8 @@ static int htb_dump(struct Qdisc *sch, struct sk_buff *skb) if (nla_put(skb, TCA_HTB_INIT, sizeof(gopt), &gopt) || nla_put_u32(skb, TCA_HTB_DIRECT_QLEN, q->direct_qlen)) goto nla_put_failure; + if (q->offload && nla_put_flag(skb, TCA_HTB_OFFLOAD)) + goto nla_put_failure; return nla_nest_end(skb, nest); @@ -1144,19 +1285,97 @@ htb_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d) return gnet_stats_copy_app(d, &cl->xstats, sizeof(cl->xstats)); } +static struct netdev_queue * +htb_select_queue(struct Qdisc *sch, struct tcmsg *tcm) +{ + struct net_device *dev = qdisc_dev(sch); + struct tc_htb_qopt_offload offload_opt; + int err; + + offload_opt = (struct tc_htb_qopt_offload) { + .command = TC_HTB_LEAF_QUERY_QUEUE, + .classid = TC_H_MIN(tcm->tcm_parent), + }; + err = htb_offload(dev, &offload_opt); + if (err || offload_opt.qid >= dev->num_tx_queues) + return NULL; + return netdev_get_tx_queue(dev, offload_opt.qid); +} + +static struct Qdisc * +htb_graft_helper(struct netdev_queue *dev_queue, struct Qdisc *new_q) +{ + struct net_device *dev = dev_queue->dev; + struct Qdisc *old_q; + + if (dev->flags & IFF_UP) + dev_deactivate(dev); + old_q = dev_graft_qdisc(dev_queue, new_q); + if (new_q) + new_q->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT; + if (dev->flags & IFF_UP) + dev_activate(dev); + + return old_q; +} + +static void htb_offload_move_qdisc(struct Qdisc *sch, u16 qid_old, u16 qid_new) +{ + struct netdev_queue *queue_old, *queue_new; + struct net_device *dev = qdisc_dev(sch); + struct Qdisc *qdisc; + + queue_old = netdev_get_tx_queue(dev, qid_old); + queue_new = netdev_get_tx_queue(dev, qid_new); + + if (dev->flags & IFF_UP) + dev_deactivate(dev); + qdisc = dev_graft_qdisc(queue_old, NULL); + qdisc->dev_queue = queue_new; + qdisc = dev_graft_qdisc(queue_new, qdisc); + if (dev->flags & IFF_UP) + dev_activate(dev); + + WARN_ON(!(qdisc->flags & TCQ_F_BUILTIN)); +} + static int htb_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new, struct Qdisc **old, struct netlink_ext_ack *extack) { + struct netdev_queue *dev_queue = sch->dev_queue; struct htb_class *cl = (struct htb_class *)arg; + struct htb_sched *q = qdisc_priv(sch); + struct Qdisc *old_q; if (cl->level) return -EINVAL; - if (new == NULL && - (new = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops, - cl->common.classid, extack)) == NULL) - return -ENOBUFS; + + if (q->offload) { + dev_queue = new->dev_queue; + WARN_ON(dev_queue != cl->leaf.q->dev_queue); + } + + if (!new) { + new = qdisc_create_dflt(dev_queue, &pfifo_qdisc_ops, + cl->common.classid, extack); + if (!new) + return -ENOBUFS; + } + + if (q->offload) { + htb_set_lockdep_class_child(new); + /* One ref for cl->leaf.q, the other for dev_queue->qdisc. */ + qdisc_refcount_inc(new); + old_q = htb_graft_helper(dev_queue, new); + } *old = qdisc_replace(sch, new, &cl->leaf.q); + + if (q->offload) { + WARN_ON(old_q != *old); + qdisc_put(old_q); + } + return 0; } @@ -1184,9 +1403,10 @@ static inline int htb_parent_last_child(struct htb_class *cl) return 1; } -static void htb_parent_to_leaf(struct htb_sched *q, struct htb_class *cl, +static void htb_parent_to_leaf(struct Qdisc *sch, struct htb_class *cl, struct Qdisc *new_q) { + struct htb_sched *q = qdisc_priv(sch); struct htb_class *parent = cl->parent; WARN_ON(cl->level || !cl->leaf.q || cl->prio_activity); @@ -1204,6 +1424,61 @@ static void htb_parent_to_leaf(struct htb_sched *q, struct htb_class *cl, parent->cmode = HTB_CAN_SEND; } +static void htb_parent_to_leaf_offload(struct Qdisc *sch, + struct netdev_queue *dev_queue, + struct Qdisc *new_q) +{ + struct Qdisc *old_q; + + /* One ref for cl->leaf.q, the other for dev_queue->qdisc. */ + qdisc_refcount_inc(new_q); + old_q = htb_graft_helper(dev_queue, new_q); + WARN_ON(!(old_q->flags & TCQ_F_BUILTIN)); +} + +static void htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl, + bool last_child, bool destroying) +{ + struct tc_htb_qopt_offload offload_opt; + struct Qdisc *q = cl->leaf.q; + struct Qdisc *old = NULL; + + if (cl->level) + return; + + WARN_ON(!q); + if (!destroying) { + /* On destroy of HTB, two cases are possible: + * 1. q is a normal qdisc, but q->dev_queue has noop qdisc. + * 2. q is a noop qdisc (for nodes that were inner), + * q->dev_queue is noop_netdev_queue. + */ + old = htb_graft_helper(q->dev_queue, NULL); + WARN_ON(!old); + WARN_ON(old != q); + } + + offload_opt = (struct tc_htb_qopt_offload) { + .command = last_child ? TC_HTB_LEAF_DEL_LAST : TC_HTB_LEAF_DEL, + .classid = cl->common.classid, + }; + htb_offload(qdisc_dev(sch), &offload_opt); + + qdisc_put(old); + + if (last_child) + return; + + if (offload_opt.moved_qid != 0) { + if (destroying) + q->dev_queue = netdev_get_tx_queue(qdisc_dev(sch), + offload_opt.qid); + else + htb_offload_move_qdisc(sch, offload_opt.moved_qid, + offload_opt.qid); + } +} + static void htb_destroy_class(struct Qdisc *sch, struct htb_class *cl) { if (!cl->level) { @@ -1217,8 +1492,11 @@ static void htb_destroy_class(struct Qdisc *sch, struct htb_class *cl) static void htb_destroy(struct Qdisc *sch) { + struct net_device *dev = qdisc_dev(sch); + struct tc_htb_qopt_offload offload_opt; struct htb_sched *q = qdisc_priv(sch); struct hlist_node *next; + bool nonempty, changed; struct htb_class *cl; unsigned int i; @@ -1237,13 +1515,58 @@ static void htb_destroy(struct Qdisc *sch) cl->block = NULL; } } - for (i = 0; i < q->clhash.hashsize; i++) { - hlist_for_each_entry_safe(cl, next, &q->clhash.hash[i], - common.hnode) - htb_destroy_class(sch, cl); - } + + do { + nonempty = false; + changed = false; + for (i = 0; i < q->clhash.hashsize; i++) { + hlist_for_each_entry_safe(cl, next, &q->clhash.hash[i], + common.hnode) { + bool last_child; + + if (!q->offload) { + htb_destroy_class(sch, cl); + continue; + } + + nonempty = true; + + if (cl->level) + continue; + + changed = true; + + last_child = htb_parent_last_child(cl); + htb_destroy_class_offload(sch, cl, last_child, + true); + qdisc_class_hash_remove(&q->clhash, + &cl->common); + if (cl->parent) + cl->parent->children--; + if (last_child) + htb_parent_to_leaf(sch, cl, NULL); + htb_destroy_class(sch, cl); + } + } + } while (changed); + WARN_ON(nonempty); + qdisc_class_hash_destroy(&q->clhash); __qdisc_reset_queue(&q->direct_queue); + + if (!q->offload) + return; + + offload_opt = (struct tc_htb_qopt_offload) { + .command = TC_HTB_DESTROY, + }; + htb_offload(dev, &offload_opt); + + if (!q->direct_qdiscs) + return; + for (i = 0; i < q->num_direct_qdiscs && q->direct_qdiscs[i]; i++) + qdisc_put(q->direct_qdiscs[i]); + kfree(q->direct_qdiscs); } static int htb_delete(struct Qdisc *sch, unsigned long arg) @@ -1260,11 +1583,24 @@ static int htb_delete(struct Qdisc *sch, unsigned long arg) if (cl->children || cl->filter_cnt) return -EBUSY; - if (!cl->level && htb_parent_last_child(cl)) { - new_q = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops, + if (!cl->level && htb_parent_last_child(cl)) + last_child = 1; + + if (q->offload) + htb_destroy_class_offload(sch, cl, last_child, false); + + if (last_child) { + struct netdev_queue *dev_queue; + + dev_queue = q->offload ? cl->leaf.q->dev_queue : sch->dev_queue; + new_q = qdisc_create_dflt(dev_queue, &pfifo_qdisc_ops, cl->parent->common.classid, NULL); - last_child = 1; + if (q->offload) { + if (new_q) + htb_set_lockdep_class_child(new_q); + htb_parent_to_leaf_offload(sch, dev_queue, new_q); + } } sch_tree_lock(sch); @@ -1285,7 +1621,7 @@ static int htb_delete(struct Qdisc *sch, unsigned long arg) &q->hlevel[cl->level].wait_pq); if (last_child) - htb_parent_to_leaf(q, cl, new_q); + htb_parent_to_leaf(sch, cl, new_q); sch_tree_unlock(sch); @@ -1300,9 +1636,11 @@ static int htb_change_class(struct Qdisc *sch, u32 classid, int err = -EINVAL; struct htb_sched *q = qdisc_priv(sch); struct htb_class *cl = (struct htb_class *)*arg, *parent; + struct tc_htb_qopt_offload offload_opt; struct nlattr *opt = tca[TCA_OPTIONS]; struct nlattr *tb[TCA_HTB_MAX + 1]; struct Qdisc *parent_qdisc = NULL; + struct netdev_queue *dev_queue; struct tc_htb_opt *hopt; u64 rate64, ceil64; int warn = 0; @@ -1335,8 +1673,12 @@ static int htb_change_class(struct Qdisc *sch, u32 classid, qdisc_put_rtab(qdisc_get_rtab(&hopt->ceil, tb[TCA_HTB_CTAB], NULL)); + rate64 = tb[TCA_HTB_RATE64] ? nla_get_u64(tb[TCA_HTB_RATE64]) : 0; + ceil64 = tb[TCA_HTB_CEIL64] ? nla_get_u64(tb[TCA_HTB_CEIL64]) : 0; + if (!cl) { /* new class */ - struct Qdisc *new_q; + struct net_device *dev = qdisc_dev(sch); + struct Qdisc *new_q, *old_q; int prio; struct { struct nlattr nla; @@ -1379,11 +1721,8 @@ static int htb_change_class(struct Qdisc *sch, u32 classid, NULL, qdisc_root_sleeping_running(sch), tca[TCA_RATE] ? : &est.nla); - if (err) { - tcf_block_put(cl->block); - kfree(cl); - goto failure; - } + if (err) + goto err_block_put; } cl->children = 0; @@ -1392,12 +1731,72 @@ static int htb_change_class(struct Qdisc *sch, u32 classid, for (prio = 0; prio < TC_HTB_NUMPRIO; prio++) RB_CLEAR_NODE(&cl->node[prio]); + cl->common.classid = classid; + + /* Make sure nothing interrupts us in between of two + * ndo_setup_tc calls. + */ + ASSERT_RTNL(); + /* create leaf qdisc early because it uses kmalloc(GFP_KERNEL) * so that can't be used inside of sch_tree_lock * -- thanks to Karlis Peisenieks */ - new_q = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops, + if (!q->offload) { + dev_queue = sch->dev_queue; + } else if (!(parent && !parent->level)) { + /* Assign a dev_queue to this classid. */ + offload_opt = (struct tc_htb_qopt_offload) { + .command = TC_HTB_LEAF_ALLOC_QUEUE, + .classid = cl->common.classid, + .parent_classid = parent ? + TC_H_MIN(parent->common.classid) : + TC_HTB_CLASSID_ROOT, + .rate = max_t(u64, hopt->rate.rate, rate64), + .ceil = max_t(u64, hopt->ceil.rate, ceil64), + }; + err = htb_offload(dev, &offload_opt); + if (err) { + pr_err("htb: TC_HTB_LEAF_ALLOC_QUEUE failed with err = %d\n", + err); + goto err_kill_estimator; + } + dev_queue = netdev_get_tx_queue(dev, offload_opt.qid); + } else { /* First child. */ + dev_queue = parent->leaf.q->dev_queue; + old_q = htb_graft_helper(dev_queue, NULL); + WARN_ON(old_q != parent->leaf.q); + offload_opt = (struct tc_htb_qopt_offload) { + .command = TC_HTB_LEAF_TO_INNER, + .classid = cl->common.classid, + .parent_classid = + TC_H_MIN(parent->common.classid), + .rate = max_t(u64, hopt->rate.rate, rate64), + .ceil = max_t(u64, hopt->ceil.rate, ceil64), + }; + err = htb_offload(dev, &offload_opt); + if (err) { + pr_err("htb: TC_HTB_LEAF_TO_INNER failed with err = %d\n", + err); + htb_graft_helper(dev_queue, old_q); + goto err_kill_estimator; + } + qdisc_put(old_q); + } + new_q = qdisc_create_dflt(dev_queue, &pfifo_qdisc_ops, classid, NULL); + if (q->offload) { + if (new_q) { + htb_set_lockdep_class_child(new_q); + /* One ref for cl->leaf.q, the other for + * dev_queue->qdisc. + */ + qdisc_refcount_inc(new_q); + } + old_q = htb_graft_helper(dev_queue, new_q); + /* No qdisc_put needed. */ + WARN_ON(!(old_q->flags & TCQ_F_BUILTIN)); + } sch_tree_lock(sch); if (parent && !parent->level) { /* turn parent into inner node */ @@ -1415,10 +1814,10 @@ static int htb_change_class(struct Qdisc *sch, u32 classid, : TC_HTB_MAXDEPTH) - 1; memset(&parent->inner, 0, sizeof(parent->inner)); } + /* leaf (we) needs elementary qdisc */ cl->leaf.q = new_q ? new_q : &noop_qdisc; - cl->common.classid = classid; cl->parent = parent; /* set class to be in HTB_CAN_SEND state */ @@ -1444,12 +1843,29 @@ static int htb_change_class(struct Qdisc *sch, u32 classid, if (err) return err; } - sch_tree_lock(sch); - } - rate64 = tb[TCA_HTB_RATE64] ? nla_get_u64(tb[TCA_HTB_RATE64]) : 0; + if (q->offload) { + struct net_device *dev = qdisc_dev(sch); - ceil64 = tb[TCA_HTB_CEIL64] ? nla_get_u64(tb[TCA_HTB_CEIL64]) : 0; + offload_opt = (struct tc_htb_qopt_offload) { + .command = TC_HTB_NODE_MODIFY, + .classid = cl->common.classid, + .rate = max_t(u64, hopt->rate.rate, rate64), + .ceil = max_t(u64, hopt->ceil.rate, ceil64), + }; + err = htb_offload(dev, &offload_opt); + if (err) + /* Estimator was replaced, and rollback may fail + * as well, so we don't try to recover it, and + * the estimator won't work property with the + * offload anyway, because bstats are updated + * only when the stats are queried. + */ + return err; + } + + sch_tree_lock(sch); + } psched_ratecfg_precompute(&cl->rate, &hopt->rate, rate64); psched_ratecfg_precompute(&cl->ceil, &hopt->ceil, ceil64); @@ -1492,6 +1908,11 @@ static int htb_change_class(struct Qdisc *sch, u32 classid, *arg = (unsigned long)cl; return 0; +err_kill_estimator: + gen_kill_estimator(&cl->rate_est); +err_block_put: + tcf_block_put(cl->block); + kfree(cl); failure: return err; } @@ -1557,6 +1978,7 @@ static void htb_walk(struct Qdisc *sch, struct qdisc_walker *arg) } static const struct Qdisc_class_ops htb_class_ops = { + .select_queue = htb_select_queue, .graft = htb_graft, .leaf = htb_leaf, .qlen_notify = htb_qlen_notify, @@ -1579,6 +2001,7 @@ static struct Qdisc_ops htb_qdisc_ops __read_mostly = { .dequeue = htb_dequeue, .peek = qdisc_peek_dequeued, .init = htb_init, + .attach = htb_attach, .reset = htb_reset, .destroy = htb_destroy, .dump = htb_dump, diff --git a/tools/include/uapi/linux/pkt_sched.h b/tools/include/uapi/linux/pkt_sched.h index 0d18b1d1fbbc..5c903abc9fa5 100644 --- a/tools/include/uapi/linux/pkt_sched.h +++ b/tools/include/uapi/linux/pkt_sched.h @@ -414,6 +414,7 @@ enum { TCA_HTB_RATE64, TCA_HTB_CEIL64, TCA_HTB_PAD, + TCA_HTB_OFFLOAD, __TCA_HTB_MAX, };